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The application of data mining analyses (DM) is effective for the quantitative classification of human intestinal 
microbiota (HIM). However, there remain various technical problems that must be overcome. This paper deals 
with the number of nominal partitions (NP) of the target dataset, which is a major technical problem. We used here 
terminal restriction fragment length polymorphism data, which was obtained from the feces of 92 Japanese men. 
Data comprised operational taxonomic units (OTUs) and subject smoking and drinking habits, which were effectively 
classified by two NP (2-NP; Yes or No). Using the same OTU data, 3-NP and 5-NP were examined here and results were 
obtained, focusing on the accuracies of prediction, and the reliability of the selected OTUs by DM were compared 
to the former 2-NP. Restriction enzymes for PCR were further affected by the accuracy and were compared with 
7 enzymes. There were subjects who possess HIM at the border zones of partitions, and the greater the number of 
partitions, the lower the obtained DM accuracy. The application of balance nodes boosted and duplicated the data, 
and was able to improve accuracy. More accurate and reliable DM operations are applicable to the classification of 
unknown subjects for identifying various characteristics, including disease. 

Key words: human intestinal microbiota, operational taxonomic unit, data mining analysis, decision tree, nominal 
partitions of data, accuracy of classification, balance node 



INTRODUCTION 

Human intestinal microbiota (HIM) is related to our 
health, and practical research on the relationship with 
the human immune systems and diseases is now being 
widely performed. Our previous papers [1-3] have 
assessed HIM data obtained by data mining analysis 
(DM) for quantitative classification of the relationship 
between subject characteristics. The results were fruitful, 
but due to the unique application of DM to HIM, some 
accumulation of case studies is required for further 
DM operations. The selection of primer-restriction 
enzymes and the number of nominal partitions (NP) of 
assigned characteristics are important factors for reliable 
applications. This paper aims to compare the effects 
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of both factors for obtaining accurate and dependable 
DM results, which are the major technical problems of 
practical applications. 

The number of NP, which is a partition of assigned 
characteristics and depends on the purpose of the analysis, 
directly affects the accuracy of the DM results. In other 
words, proper NP application to the data is necessary. 
Our previous paper [3] already dealt with a simple 2 
nominal partition (2-NP), i.e., Yes or No, and examined 
the accuracy between the 7 restriction enzymes. Here, we 
aim to further examine another 2 types of NP, 3-NP and 
5-NP, and to compare the 3 types of NP, including 2-NP, 
as shown in Table 1, with 2 characteristics, the latter of 
which was reported previously [3], but is included here 
for comparison. The original operational taxonomic 
unit (OTU) data applied in this paper were the same as 
reported in our previous papers [1-4], but the detailed 
NPs are different. As with the previous paper [3], dietary 
factors for healthy male subjects were controlled, which 
is an important starting point for the quantitative analysis 
of HIM. 

HIM are represented here as OTUs by terminal 



130 



T. Kobayashi, et al. 



Table 1 . 2 to 5 nominal partition (NP) of the 92 subjects 



characteristics 


NP # 


mark 


area / meaning 


N. of 
subjects 


Smoking 


2-NP 


SA 


No, non-smoker + non-respondent 


76 


SB 


Yes, smoking now 


16 


3-NP 


SAA 


non-smoker + non-respondent 


57 


SAP 


all previous smokers, not now 


19 


SBB 


all present smokers 


16 


5-NP 


SAA 


non-smoker + non-respondent 


57 


SPA 


previous smoker, cess P. 2: 5Y 


14 


SP 5 


previous smoker, cess. P <5Y 


5 


SBG 


smoker, 1 5cigarettes/d or less 


12 


SBH 


heavy smoker, 16cigs./d. or more 


4 




Drinking 


2-NP 


DA 


No, non— habitual drinker 


47 


DB 


Yes, habitually drinking now 


45 


3-NP 


DA 


non— habitual drinker 


47 


DBS 


habitual drinker, 1 -3 days/w. 


21 


DBF 


habitual drinker, 4—7 days/w. 


24 


5-NP 


DA 


non-habitual drinker 


47 


DBL 


drinker, average <C20 ml"AIOH/d 


11 


DB 2 


drinker, average 20—40 ml"A/d. 


18 


DB 3 


drinker, average 41-60 ml-A/d. 


6 


DBH 


drinker, average 60 ml-A/d. < 


10 



N.: number; cess. P.: smoking cessation period; 5Y: 5 years; w.: week; 
d.: day; A10H, A: alcohol; Shadows at '2-NP' indicated that the results 
have been reported previously [3], but are shown here for comparison 
to 3-NP and 5-NP. 



restriction fragment length polymorphism (T-RFLP) 
analysis. The relationship between OTUs and subject 
characteristics was assessed by cluster analysis, using 
the methods of Jin [4] and Andoh [5, 6], or by Pearson 
correlation coefficients and principal component analysis. 
To date, DM has been applied to the relationships between 
genes, single nucleotide polymorphisms (Merelli [7]) and 
inflammatory bowel disease (Merelli [8]), as well as to 
age-dependent genes (Kirschner [9]) and hormone levels 
(Modlin [10]), but has not been applied to general HIM. 
i.e., OTUs. 

OTUs are thought to contain numerous types of 
bacteria, and their composition directly affects the 
accuracy of DM classifications. We therefore applied 
7 restriction enzymes for better comparisons of subject 
classification. DM will be applied to classify all OTU 
data, of which characteristic have various NPs, e.g., 
types or symptoms of diseases; thus, for effective DM 
operation, systematic comparisons are required and are 
examined here. 

MATERIALS AND METHODS 

As reported previously [4], to avoid the influence 
of dietary factors, we designed identical meals (1,879 
kcal/day), which were fed for 3 days to 92 healthy male 
volunteers living in Japan. Age and body mass index 
(BMI) of the subjects were 21-59 years (average: 36.8 



years) and 17.3-30.2 kg/m 2 (average: 22.6 kg/m 2 ), 
respectively. Fecal samples were analyzed by T-RFLP 
using 7 restriction enzymes [2, 4]. T-RFLP was applied 
due to its reproducibility, comparatively low cost and 
convenience with regard to DM operation. Studies were 
performed in accordance with the protocols approved 
by the Riken Research Ethics Committee (Wakou 2009- 
3rd 21-13), and the OTU data were accumulated by the 
Benno Laboratory, Riken, Japan. 

Bacterial DNA was isolated from feces using a 
modification of the method described by Matsuki [11]. 
Amplification of fecal 16S rRNA, restriction enzyme 
digestion, size fractionation of T-RFs and T-RFLP 
analysis were carried out as described previously [12- 
14]. Details of amplification and T-RFLP analysis with 
the 7 restriction enzymes, i.e., 5l6f-Bsll, 5l6f-Haelll, 
21f-Mspl, 21i-Alu\, 35f-Hhal, 35f-Mspl and 35f-Alul, 
were as described in our previous papers [2, 4]. 

The amounts for each OTU represent the fluorescence 
intensity and concentration. The obtained OTU data 
are abbreviated here as B— (— : base pair number) for 
5l6f-Bsll, HA— for 5 1 6f-//aeIII, M— for llfMspl, A— 
for 21i-Alu\, QHh— for 35f-Hhal, QM— for 35f-MspI 
and QA— for 35f-Alul. We had 2 groups of OTUs: 516f- 
+ 27f- (4 restriction enzymes), and 35f- (3 restriction 
enzymes). The component numbers of these 7 enzyme 
groups were 27 B, 33 HA, 20 M, 40 A, 31 QHh, 
34-QM and 48 QA; thus, if we combined all the enzyme 
components of the 2 groups, the former had a maximum 
of 120 OTUs, and the latter had a maximum of 1 1 3 OTUs. 
On account of the balance between the number of subjects 
(92) and OTU components, we did not mix the data from 
the 2 groups to avoid the problem of field alignment 
sequences described in previous reports [2, 3]. Various 
sets of restriction enzymes were combined, and the data 
were arranged with the answers of the 92 subjects. The 
resulting 2-dimensional Excel data were analyzed using 
DM software (IBM-SPSS, Clementine 14). 

A DM algorithm (Classification and Regression Tree 
(C&RT) modeling system), which is the most typical 
method of DM, provides a Decision tree 1 (Dt). The 
Dt explicitly classifies the various groups of subjects 
according to the assigned characteristics, as shown in 
Table 1. C&RT divides subjects into two subsets by 
comparing the Gini coefficient 2 according to the OTU 
data, such that the subjects within each subset are more 
homogeneous than in the previous subset. The C&RT 
system is flexible, and allows unequal misclassification 
costs to be considered when comparing to the other 
modeling systems of DM. A major specialty of DM and 
the constructed Dt is that a single selected OTU is used 
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Dt 4th step 



Dt 3rd step 









Node - 7 


Dt 2nd step 


£0.14 


SA 


63 








SB 


4 








Total 


67 



Dt 1 st step 



smoking 
habit 



Node -0 


SA 


76 


SB 


16 


Total 


92 



: 3.1 3 









Node 


- 3 






£0.44 


SA 


63 


Node -1 






SB 


5 


SA 


68 




r 


Total 


68 


SB 


9 




HA868 






Total 


77 









HA291 



>313 



Node - 2 


SA 


8 


SB 


7 


Total 


15 



Dt 5th step 

£9.69 , 



Node- 1 3 




63 


SB 


3 


Total 


66 



Node- 1 7 


SA 


63 


SB 




Total 


65 



-I Ml 



M316 



HA227 



HA175 



£4 28 



Node 


- 4 


SA 


5 


SB 


4 


Total 


9 




Node 


- 5 


SA 


0 


SB 


5 


Total 


5 



>0.14 



£1 .82 



Node - 8 


SA 


0 


SB 


1 


Total 


1 



Node- 1 4 


SA 


0 


SB 


1 


Total 


1 



>9.69 



Node- 1 8 


SA 


0 


SB 


1 


Total 


1 



HA227 



Node - 9 


SA 


1 


SB 


4 


Total 


5 



£1 94 



Node- 1 5 


SA 


1 


SB 


0 


Total 


1 



HA83 



HA291 



Node - 6 


SA 


8 


SB 


2 


Total 


1 0 



>1 .82 



Node- 1 0 


SA 


4 


SB 


0 


Total 


4 



Node- 16 


SA 


0 


SB 


4 


Total 


4 



HA83 



Node- 1 1 


SA 


0 


SB 


2 


Total 


2 



Node- 1 2 


SA 


8 


SB 


0 


Total 


8 



Fig. 1. Decision-tree (Dt) by 2-NP for smoking habit with 53 OTUs. 

OTUs: 33HA+20M; marked as * in Table 2; large solid arrows: 7 nodes containing all 16 smokers, 'SB'; large dotted arrow: node of 63 
nonsmokers, 'SA'. 



for each step of Dt. The default setting of the C&RT 
system grows a Dt to 5 steps. The balance nodes applied 
here are for correcting the imbalances in the dataset, 
which readily develop with higher NPs, and we conform 
to the specified test criteria and are able to obtain more 
accurate results. If necessary, balancing is carried out by 
boosting the occurrence of infrequent values at the time 
of Dt construction. 

RESULTS AND DISCUSSION 

Comparison of NPs and restriction enzymes 

The Dt produced with 2-NP, as a simple example 
for understanding and saving space, is shown in Fig. 1, 
where smoking habit of subjects was explicitly classified 
into several nodes with certain OTUs. Applying 3-NP 
and 5-NP as shown in Table 1, the subjects were divided 
according to the various purposes of DM analysis. Here, 



the number of partitions was limited to 5 because this was 
lowest number of subject group (4 as SBH, 5 as SP5 and 
6 as DB3) in Table 1. Appendix Fig. Al shows the results 
of an actual Dt with 5-NPs for smoking habit, because 
most of the results with higher NPs required more space 
to show. 

The details of Dt and the pathways to reach the 
terminal node 3 in these figures clearly show the species of 
related OTUs, which played a role in dividing the various 
subject nodes. The Dt also provides quantitative cut off 
values, namely the 92 men were divided at the 1st step 
by HA291 for 2 subsets at the left end of Fig. 1 and were 
subsequently divided. The 1st step was divided at 3. 13 by 
HA291, and the lower 2nd step was recognized as 4.28. 
The specialty of this Dt was that only 7 OTUs were active 
out of 53, considering 2 OTUs, i.e., HA291 and HA83 
being applied twice, which indicated that the remaining 
46 OTUs were neglected in constructing this Dt. In other 
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words, the 7 OTUs were closely related to subject smoking 
characteristics, and the other 46 OTUs were recognized 
as unrelated to smoking. When comparing Fig. 1 with 
similar results in previous reports [3] (Fig. 1), which 
had applied 80 OTUs (27Bsll+33Haelll+20Mspl), 
HA291 had the same cut-off value at the 1st step, but 
OTUs later than the 2nd step were different. In addition, 
2 wrongly classified subjects were observed in Fig. 1. 
These were the effects of applied OTU combinations, but 
we focused only on the accuracies of Dt until the 5th step. 
With regard to smoking habit, Biedermann et al. [15] 
recently examined the effects of smoking cessation in 5 
subjects, as compared with 10 controls, by T-RFLP and 
PCA. Their results showed an increase in Firmicutes and 
Actinobacteria and a lower proportion of Bacteroidetes 
and Proteobacteria at the phylum level. 

Similarly in Appendix Fig. Al, 12 OTUs out of 80 
were selected to construct the Dt, including HA83. The 
92 men were divided at the 1st step by B369 for subset 
2 at the left end of Appendix Fig. Al. The 1st step was 
divided at 1. 17 by B369, and the upper 2nd node (Node- 

1) included 86 subjects, with the lower 2nd node (Node- 

2) having only 6. These results were the main differences 
from the former classification methods for HIM, such 
as clustering, PCA and Pearson correlation coefficient, 
which considered all OTU data without any selections, 
and the results inevitably became obscure. Table 2 shows 
a comparison of the results for 2-NP, 3-NP and 5-NP, with 
some combination of 7 restriction enzymes for smoking 
habit. Similarly, with regard to drinking habit, the results 
shown in Table 3 also show the OTUs for the 1st step and 
the number of wrongly classified subjects among the 92. 
The latter indicates the accuracy of evaluation for each 
set of NP and restriction enzymes, the best value of which 
isO. 

Tables 2 and 3 showed that accuracy is closely related 
to the combination of restriction enzymes, not only 
horizontally in the tables, but also vertically with the 
same restriction enzymes and different NPs. The best 
accuracies were recognized as having the same OTUs 
at the 1st step. Higher NPs gave worse accuracy, with 
the exception of smoking at 3-NP and 3 combinations of 
restriction enzymes, i.e., QHh+QM, QHh+QM+QA and 
QM+QA+QHh (marked as &2, &4 and &5 in the lower 
middle of Table 2), where only 1 subject was misclassified. 
Comparing the 2 restriction enzymes group, i.e., between 
5 16f-+27f- and 35f-, the former generally seemed to have 
slightly better accuracy than the latter. Typical OTUs such 
as HA291 for heavy smokers [1-3] were only observed 
at 2-NP for smoking in Table 2, but A47 for drinking was 
widely obtained at the 1st step in Table 3. Comparing the 



2 characteristics, i.e., smoking and drinking, the former 
was rather easier for classification than the latter, which 
was previously reported [3] only with 2-NP. 

Detailed aspects of better accuracy 

Tracing the details of the referred exceptional and 
better cases marked as & 1 to &5 in the lower half of 
Tables 2 and 4 shows the detailed components of the 
Dt from the 1st step to the 3rd step. For all 5 cases, the 
1st step was the same as with QM134, which indicates 
exceptional accuracy. The reason why these cases had 
such results is the structure of the Dt configurations. 
The 3 cases that the best values, i.e., &2, &4 and &5, 
revealed that the structure of OTUs was the same until 
the 2nd step, and that the 3rd step was slightly different. 
Furthermore, the restriction enzymes in these 3 cases, 
i.e., QHh+QM, QHh+QM+QA and QM+QA+QHh, 
had a similar Dt configuration until the 5th step. Even 
though the selected OTUs were different, the locations 
of missing nodes were similar at the 4th and 5th steps, 
which are not shown in Table 4. This suggests that 
OTUs constructed from individual Dt after the 4th step 
were replaceable with certain OTUs, and that QA was 
less workable for this classification than QHh and QM. 
Finally, OTUs for QM134 played the best role in subject 
classification for smoking with 3-NP and 3 restriction 
enzymes (35f-), while QHM78 and QHh574 at the 2nd 
step played secondary important roles. 

Subject features for good classification 

Although the values for accuracy were simply compared 
in Tables 2 and 3, each subject had their own individual 
OTU features, which were classified with varying levels 
of ease. In other words, some subjects might have cloudy 
or boundary features for being classified. Thus, for single 
utilization of the 4 restriction enzymes, i.e., B, HA, M 
and A, with 3-NP and 5-NP, the misclassified subjects 
were individually traced and examined. The number of 
misclassified subjects, redundantly observed subjects, the 
rate of wrongly observed subjects among the 92 and the 
rate of always properly classified subjects were examined 
and are listed in Table 5. Interestingly, values were 
recognized in the latter 2 rates, namely, that these 2 rates 
themselves were easily understood due to the features 
of OTUs. Furthermore, the intermediary values between 
these 2 rates, i.e., 100 - 26.1 - 55.4=18.5% for smoking 
and 29.4% for drinking, were the middle features of the 
92 subjects, which were classified properly at either 
3-NP or 5-NP. These features were closely combined 
with smoking or drinking characteristics. Differences 
and specificities were observed clearly with the values in 
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Table 2. Comparison of nominal partitions for accuracy of DM and smoking habit 



Species of R.Enz. 


B 


HA 


M 


A 


B+ 

HA 


HA+M 


HA+A 


B+HA+ 

M 


HA+M+ 
B 


A+M+H 
A 


B+HA+ 
M+A 


M+A+B 
+HA 


OTU of Dt-lst 
step 


2-NP 


B919 


HA291 


M133 


A87 


HA291 


HA291 


HA291 


HA291 


HA291 


HA291 


HA201 


HA291 


N. of wrongly 
classified subjects 
among 92 


1 


3 


7 


4 


0 


2* 


1 


o #1 


0 


1 


1 


1 




OTU of Dt-1st 
step 


3-NP 


B494 


HA995 


M208 


A80 


HA995 


HA995 


HA995 


HA995 


HA995 


HA995 


HA995 


HA995 


N. of wrongly 
classified subjects 
among 92 


10 


13 


15 


1 1 


16 


14 


13 


12 #2 


12 


13 


12 


12 




OTU of Dt-lst 
step 


5-NP 


B494 


HA995 


M208 


A238 


HA995 


HA995 


HA995 


HA995 


B369 


HA995 


B369 


B369 


N. of wrongly 
classified subjects 
among 92 


17 


20 


14 


19 


21 


9 


16 


21 * 3 


11* 


16 


17 


17 



Species of R.Enz. 


QHh 


QM 


QA 




QHh+ 
QM 


QM+ 
QA 


QA + 

QHh 


QHh+ 
QM+QA 


QM+QA 
+QHh 


OTU of Dflst 
step 


2-NP 


GHh601 


QM124 


QA829 


OM124 


QM1 24 


OA829 


OM124 


QM124 


N. of wrongly 
classified subjects 
among 92 


7 


3 


7 


2 


4 


4 


4 


4 



OTU of Dt-lst 
step 


3-NP 


OHH601 


OM134 


QA131 


N. of wrongly 
classified subjects 
among 92 


20 


9& , 


15 




OTU of Dt-1st 
step 


5-NP 


QHh728 


OM134 


QA131 


N. of wrongly 
classified subjects 
among 92 


25 


9 


22 



OM134 


OM134 


QA131 


QM134 


OM134 


1& 2 


c &3 
5 


16 


1 &4 


1 &5 




QM134 


QM134 


QA131 


QM134 


OM134 


9 


8 


21 


8 


8 



R.Enz.: primer restriction enzymes; N.: number; NP: nominal partition; N. of wrongly classified subjects: number of misclassified subjects up to 5th 
step=accuracy; Combination of R.Enz. indicated sequences in DM processing; *: detailed Dt is shown in Fig. 1; s : detailed Dt is shown in Appendix 
Fig. Al; #1 -* 3 : compared with balance nodes in Table 6; &1 ~* 5 : OTUs obtained up to 3rd step are shown in Table 4; Shadow at '2-NP' indicates that 
the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP. 



Table 5; smoking was comparatively easy to classify, and 
drinking was more ambiguous than smoking, which were 
recognized with the physiological stresses to the subjects. 

Balance node, application of boosted apparent subjects 

As shown in Table 1, with large NP, i.e., 5-NP, numbers 
of component data became small and imbalanced, 
e.g., SB5, SBH and DB3. If the minimum number vs. 
the maximum data was less than 15%, the obtained Dt 
was considered to be less stable, and was easily shifted 
using a slight change in the minimum component data. 
To overcome these problems, the DM software provides 
special methods for applying balance nodes, which 
boosts and duplicates subjects during Dt construction. 
Boosting refers to the multiple utilization of minor data 
components, which allows the total apparent data to be 



balanced. However, the total number of subjects increases 
naturally depending on the applied multiple rates for each 
component. After Dt was constructed, the original data for 
the 92 subjects without any boosting was applied to the 
obtained Dt, and the accuracy was normally examined. 
The detailed mechanisms of boosting and preparing the 
subjects are shown in Fig. 2 and Table 6 for the cases of 
smoking marked as #2 and #3 shown in the upper middle 
part of Table 2. In the left half of Table 6, the original 
dataset without boosting is shown, and in the middle 
of the table, multiple rates for boosting and number of 
apparent subjects are indicated. On the right side, the 
results examined normally with the original dataset, i.e., 
92, are shown. Comparing the results in Table 6, with and 
without the balance nodes, the advancement improved, 
particularly at the case of imbalanced datasets, i.e., 5-NP. 
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Table 3 . Comparison of nominal partitions for accuracy of DM and drinking habit 



Species of R.Enz 


B 


HA 


M 


A 


OTU of Df1 st 
step 


2-NP 


B657 


HA130 


M45 


A47 


N. of wrongly 
classified subjects 
among 92 


5 


2 


5 


3 



B+HA 


B+A 


A+B 


HA+M 


B657 


A47 


A47 


M45 


5 


0 


0 


2 



Species of R.Enz 


QHh 


QM 


QA 


OTU of Dt-1 St 
step 


2-NP 


OHh601 


QM194 


OA422 


N. of wrongly 
classified subjects 
among 92 


12 


10 


11 






OTU of Dt-1 st 
step (1 ) 


3-NP 


QHh584 


QM194 


QA422 


N. of wrongly 
classified subjects 
among 92 


19 


19 


26 






OTU of Dt-1 st 

step 


5-NP 


QHh601 


OM194 


OA422 


N. of wrongly 
classified subjects 
among 92 


17 


19 


27 



QHh+QA 


QA+QHh 


QM+QA 


QA422 


OA422 


OA422 


5 


5 


9 




QA422 


QA422 


OA422 


19 


19 


14 




OA422 


OA422 


OA422 


18 


18 


14 



OTU of Df1 st 
step (1 ) 




B657 


HA130 


M45 


A47 




B657 


A47 


A47 


M45 


N. of wrongly 
classified subjects 
among 92 


3-NP 


10 


7 


7 


12 




8 


6 


6 


7 



OTU of Dt-1 st 

step 


5-NP 


B657 


HAI94 


M45 


A47 




HA194 


A47 


A47 


HA194 


N. of wrongly 
classified subjects 
among 92 


21 


21 


15 


20 




21 


21 


21 


21 



M+A+B 


M+A+HA 


A47 


A47 


0 


0 




A47 


A47 


6 


7 




A47 


HA194 


16 


17 




QHh+QM 
+QA 


QM+QA+ 
QHh 


OA422 


OA422 


10 


10 




QA422 


QA422 


14 


14 




OA42? 


OA422 


16 


16 



B+HA+ 
M+A 


M+A+B 
+HA 


A47 


A47 


3 


0 






A47 


A47 


7 


7 






HA194 


HA194 


17 


16 



All notations are the same as in Table 2. 



Table 4. Comparison of detailed Dts with better 3-NP, marked as &1 - &5 in Table 2 



Species of R.Enz. 


QM 


OTU of Dt-1 st step 




QM134 


OTU of Dt-2 nd step 


3-NP 


QM134. 
QM124 


OTU of Df3 rd step 


QM124, QM194, 
QM544, QM83 


N. of wrongly classified 
subjects among 92 up to 

Dt 5 th step 




9& 1 



QHh+QM 


QM+QA 


QM134 


QM134 


QHh178, 
QHh574 


QA422. 
QA58 


QM194, QM171, 
QHh361, OHh555 


QM134, OM200. 
QM124, OM134 


1& 2 


,-&3 

0 



QHh+QM 
+QA 


QM+QA 
+QHh 


QM134 


QM134 


QHh178, 
QHh574 


QHh178. 
QHh574 


QA237, OM171, 
OHh361, OA422 


OA237, QM171, 
QM124. QA422 


1& 4 


j&5 



Smoking habit; all of their Dt 1st step: QM134; &2, &4 and &5 are as described in the text; Other notations are the same 
as for Tables 1 and 2. 



Table 5. Comparison of wrongly classified subjects with NPs and characteristics 



characteristics 


Nominal 
Partition 


Total of wrongly 
classified subjects 
among 92 with single use 
of R Enz in Table 2-3 


N. of subjects 
redundantly 
observed in left 


N. of subjects 
wrongly observed 
with hoth 3-NP and 
5-NP in Table 2-3 


N. of subjects who 
were always properly 
classified with both 
3-NP and 5-NP 


Smoking 


3-NP 


49 


12 


24 (26.1%) 


51 (55.4%) 


5-NP 


70 


21 




Drinking 


3-NP 


36 


3 


27 (29.3%) 


38 (41.3%) 


5-NP 


77 


21 



Single use of 4 R.Enz.: B, HA, M and A; (■ ■ ■): rates among 92 subjects (%). 
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Namely, the accuracy improved from 77.2% to 89.1%, 
which shows the advantage of balance nodes. However, 
in the case of 3-NP, components that are less imbalanced 
(i.e., 28.1%), the progress was a slight, from 87.0 to 88.0, 
but the obtained Dt configuration was different. 

The obtained OTUs up to the 3rd step were also shown 
in Table 6 for comparison. The configuration of the Dt 
indicates the effects of balance nodes. First, the obtained 
OTUs with the balance node became different from 
OTUs with normal DM. Second, HA291, which was 
recognized as the most related OTU to heavy smokers [2, 
3], appeared after the application of balance nodes with 
5-NP at the 1st step and the lower 2nd step, which are 
underlined in Table 6. This indicates that the Dt structure 
after applying balance nodes is similar to the Dt with 
2-NP, which is shown in the upper middle part of Table 6. 
If the OTU dataset has imbalanced components, a more 
stable Dt configuration and OTUs are obtained with 
the application of balance nodes. Wide imbalances in a 
dataset, such as having uneven components, take place 
occasionally with HIM analyses of large NPs (5 or more). 

Effects of nominal partitions 

With regard to the selection of effective restriction 
enzymes to obtain the accurate DM results, Tables 2 
and 3 gave us a good example for smoking and drinking 
habits. The applications of 2 to 3 combined restriction 




m. rate to Balance node 



Application of 
balance nodes 



DM with apparent 
subjects 



Obtained Dt is checked 
against original dataset 



Dt & its accuracy 
with balance nodes 



Fig. 2. Flow-chart at utilization of balance nodes. 

m.rate: multiple rate for boosting data; Balance nodes are used 
to correct imbalances in a target dataset. Practical criteria for 
application were unclear between 10% and 20%. The applied 
results are shown in Table 6. 



enzymes revealed better results. Furthermore, in these 
limited cases, the 516f- + 27f- group exhibited better 
results than the 35f- group. 



Table 6. Application of balance nodes, accuracy and Dt configuration 



NP 


mark 


area / meaning 


real subjects, without balance node 


with balance node only at Dt constr. 


N. of real 
subjects 


& to 
la rgest 
data 


N of wrongly 
classified subjects 
among 92 


accuracy % 


Dt 1st step 
Dt 3rd step 


m rate to 
ba la nee 
node 


N of aparent 
subjects at 
Dt constr. 


N of wrongly 
classified subjects 
among 92 


accuracy % 


Dt 1st step 
Dt 3rd step 


2-NP 


SA 


No, non-smoker + no n- respondent 


76 


100 




100 


HA291 




SB 


Yes, smoking now 


16 


21.1 


8469, HA291 




N of subjects at Dt constr 


92 




B749, B1 24, 
- B919 



3-NP 


SAA 


non-smoker + no n- respondent 


57 


100 


12 #2 


87.0 


HA995 


1 000 


57 


11 


88.0 


B919 


SAP 


all previous smokers, not now 


19 


333 


B494. B494 


3000 


57 


B369. B940 


SBB 


all present smokers 


16 


28.1 


-, B105, 
B919. - 


3.562 


57 


HA291 . -, 
HA336. B1 68 




N. of subjects at Dt constr. 


92 




171 





5-NP 


SAA 


non-smoker + non-respondent 


57 


100 


21 #3 


77.2 


HA995 


1 0 


57 


10 


89.1 


HA291 


SPA 


previous smoker, cess. P.* =5Y 


14 


24.6 


B494, B494 


4 0 


56 


B919, HA291 


SP5 


previous smoker, cess. P.* <5Y 


5 


88 


B110, B106, 
HA778. 


110 


55 


B469, B754, 
B124, HA336 


SBG 


smoker, 1 5cigarettes/d. or less 


12 


21.1 




5.0 


60 




SBH 


heavy smoker, 16cigs./d. or more 


4 


7.0 


14 0 


56 




N of subjects at Dt constr 


92 








284 







Smoking habit; R.Enz.: 27B+33HA+20M; NP: nominal partition; N.: number; constr.: construction; N. of real subjects: original dataset: 92; m.rate: 
multiple rate at boosting the data; N. of apparent subjects: boosted subjects for Dt construction; #1 - #3: compared with balance nodes in Table 2; 1st 
step — 3rd step were connected only vertically, not related to the left end column, i.e., SAA - SP5; "-" at Dt configuration: missing OTU; Shadow at 
'2-NP' indicates that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP; Basic application schemes 
for balance nodes are shown in Fig. 2 as a flow-chart. 
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Focusing on the effects of NPs, which were also 
observed in Tables 2 and 3, the more NPs were applied, 
the less accuracy was generally obtained. This provided 
valuable information about both the selection of related 
OTUs, and confirmed an effective and stable method for 
DM processing. Moreover, we obtained in parallel lists 
of classified subjects, who were situated in the terminal 
Dt nodes. This means that one is able to classify or 
discriminate individual subjects, which were visually 
understood in Appendix Fig. Al with 5-NP. 

The OTU of the 1st step indicated here in the figures 
and tables was the most related OTU to the assigned 
characteristics, and the 4th and 5th steps were thought to 
show some indirect effects, such as local effects in certain 
areas of OTUs. Focusing only to the OTU of the 1st step, 
with increasing NP (3-NP or more), less accuracy for 
DM was observed, as shown in Tables 2 and 3, which is 
an essential problem of DM processing. However, 5-NP 
has 4 borders within the dataset and gave worse accuracy 
when compared to 2-NP, which has only 1 border. The 
greater the number of NPs, the more the subjects are 
situated in the border zones of partitions. Therefore, to 
obtain a clear and simple Dt structure and steady OTU, 
it is preferable to utilize small NPs (2-NP or 3-NP) than 
large NPs (5-NP or more). On the other hand, there is 
a remedy for large NPs and unbalanced components; 
application of the balance node, as shown in Fig. 2 and 
Table 6. However, there is a limit on improving accuracy 
due to the principal mechanisms of a dataset, such as the 
existence of subjects situated at the border zone. 

1 Decision tree: decision supporting pathway that makes use of 
a treelike graph, growing left to right. 

2 Gini coefficient: g(t) is used for quantitative evaluation of 
group impurity, and is defined at node t in C&RT, as 

, where i and j are categories of the 

target field. 

3 terminal node: tree nodes that do not split further. 
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Appendix Fig. Al. Decision-tree by 5-NP for smoking habit with 80 OTUs. 

OTUs: 33HA+20M+27B; marked as $ in Table 2; large solid arrows: 3 nodes for heavy smokers, 'SBH' in Table 1; large 
dotted arrow: node for 47 nonsmokers, 'SAA'; thin dotted arrows: misclassified subject(s) until 5th step, of which the 
total number was 1 1 , marked as $ in Table 2. 



